-
Notifications
You must be signed in to change notification settings - Fork 2.8k
feat(evaluation): Add CJK tokenizer support for ROUGE-1 evaluation #4143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Fixes google#4122 The default ROUGE tokenizer only recognizes ASCII alphanumeric characters, causing ROUGE-1 scores to be 0.0 for CJK (Chinese, Japanese, Korean) text. Changes: - Add CJKTokenizer class that handles CJK characters individually while preserving word-based tokenization for ASCII alphanumeric characters - Add RougeScoreCriterion to allow explicit tokenizer specification - Update RougeEvaluator to support custom tokenizers - Add warning when CJK text is detected without CJK tokenizer - Preserve backward compatibility: default behavior unchanged Usage: criterion = RougeScoreCriterion(threshold=0.8, tokenizer='cjk') eval_metric = EvalMetric( metric_name='response_match_score', threshold=0.8, criterion=criterion, ) Limitations: - Fullwidth alphanumeric (A-Z, 0-9) are skipped - Greek, Cyrillic, and other non-CJK scripts are skipped - This is character-based tokenization, not morphological analysis
Summary of ChangesHello @maru0804, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the ROUGE-1 evaluation framework by introducing robust support for CJK (Chinese, Japanese, Korean) languages. Previously, the default ROUGE tokenizer's ASCII-only nature led to inaccurate, often zero, scores for CJK text. The changes address this by providing an opt-in CJK-specific tokenizer and a mechanism to configure it, ensuring that evaluations involving multilingual content yield meaningful results. The update also includes a helpful warning system to alert users when CJK text is present but no appropriate tokenizer is configured, promoting correct usage. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request does an excellent job of adding CJK language support for ROUGE-1 evaluation, addressing a significant limitation of the default tokenizer. The introduction of CJKTokenizer and RougeScoreCriterion is well-designed, and the changes are thoughtfully integrated to ensure backward compatibility. The addition of comprehensive unit tests covering the new tokenizer, scoring logic, and warning behavior is particularly commendable. My review includes a couple of suggestions to improve performance and code conciseness, but overall, this is a high-quality contribution.
…instantiation Address Gemini Code Assist review feedback: 1. Simplify CJKTokenizer.tokenize() using re.findall instead of manual loop - More concise and idiomatic Python - Same functionality with fewer lines of code 2. Optimize RougeScorer instantiation for performance - Create RougeScorer once in __init__ instead of per invocation - Reuse self._scorer across all evaluate_invocations calls - Avoids unnecessary object creation in loops
|
Hi @maru0804, Thank you for your contribution! We appreciate you taking the time to submit this pull request. Your PR has been received by the team and is currently under review. We will provide feedback as soon as we have an update to share. |
|
Hi @ankursharmas , can you please review this. |
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for CJK languages in ROUGE-1 evaluation by adding an optional CJK-aware tokenizer. The changes are well-structured, with a new RougeScoreCriterion for configuration, a CJKTokenizer implementation, and updates to RougeEvaluator to use it. The refactoring of RougeEvaluator to initialize the scorer once is a good performance improvement. The new functionality is backed by a comprehensive set of unit tests covering tokenization, score calculation, and the new warning mechanism for misconfigurations. My review includes a couple of suggestions to improve code clarity and test robustness. Overall, this is a high-quality contribution.
| tokenizer: Optional[tokenizers.Tokenizer] = None | ||
| use_stemmer = True | ||
|
|
||
| if eval_metric.criterion: | ||
| try: | ||
| criterion = RougeScoreCriterion.model_validate( | ||
| eval_metric.criterion.model_dump() | ||
| ) | ||
| if criterion.tokenizer == "cjk": | ||
| tokenizer = CJKTokenizer() | ||
| use_stemmer = False # Stemming not applicable to CJK | ||
| except ValidationError: | ||
| pass # Different criterion type, ignore | ||
|
|
||
| # Create scorer once for reuse across invocations (performance optimization) | ||
| if tokenizer: | ||
| self._scorer = rouge_scorer.RougeScorer( | ||
| ["rouge1"], use_stemmer=False, tokenizer=tokenizer | ||
| ) | ||
| self._has_cjk_tokenizer = True | ||
| else: | ||
| self._scorer = rouge_scorer.RougeScorer( | ||
| ["rouge1"], use_stemmer=use_stemmer | ||
| ) | ||
| self._has_cjk_tokenizer = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for determining whether to use a stemmer can be simplified. The use_stemmer variable is only ever True when it's used to initialize RougeScorer in the else branch. The assignment use_stemmer = False is redundant because when a custom tokenizer is provided, use_stemmer is hardcoded to False during scorer creation anyway.
You can simplify this block to improve clarity and remove the redundant variable and its associated logic.
tokenizer: Optional[tokenizers.Tokenizer] = None
if eval_metric.criterion:
try:
criterion = RougeScoreCriterion.model_validate(
eval_metric.criterion.model_dump()
)
if criterion.tokenizer == "cjk":
tokenizer = CJKTokenizer()
except ValidationError:
pass # Different criterion type, ignore
# Create scorer once for reuse across invocations (performance optimization)
if tokenizer:
self._scorer = rouge_scorer.RougeScorer(
["rouge1"], use_stemmer=False, tokenizer=tokenizer
)
self._has_cjk_tokenizer = True
else:
self._scorer = rouge_scorer.RougeScorer(
["rouge1"], use_stemmer=True
)
self._has_cjk_tokenizer = False| def test_tokenize_korean(self): | ||
| tokenizer = CJKTokenizer() | ||
| tokens = tokenizer.tokenize("테스트") | ||
| assert len(tokens) == 3 # 3 Hangul syllables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assertion is correct, but it could be more specific to prevent potential regressions. Checking only the length of the token list might not catch issues where the tokenization is incorrect but produces the right number of tokens (e.g., ['테', '스', 'x']). It would be more robust to assert the exact content of the token list.
| assert len(tokens) == 3 # 3 Hangul syllables | |
| assert tokens == ["테", "스", "트"] |
Summary
Fixes #4122
The default ROUGE tokenizer only recognizes ASCII alphanumeric characters (
[a-z0-9]), causing ROUGE-1 scores to be 0.0 for CJK (Chinese, Japanese, Korean) text. This PR adds CJK language support through an opt-in tokenizer.Changes
New Features
CJKTokenizer: A character-based tokenizer for CJK languages that:RougeScoreCriterion: New criterion class to specify tokenizer optionsModified
RougeEvaluator: Updated to support custom tokenizers and log a warning (once) when CJK text is detected without proper tokenizer configurationResponseEvaluator: Updated to passeval_metric(including criterion) toRougeEvaluatorUsage
Backward Compatibility
Limitations (documented in docstrings)
Test Coverage
Added 29 new tests covering: